Universal Access To All Knowledge
Home Donate | Store | Blog | FAQ | Jobs | Volunteer Positions | Contact | Bios | Forums | Projects | Terms, Privacy, & Copyright
Search: Advanced Search
Anonymous User (login or join us)
Upload

Reply to this post | See parent post | Go Back
View Post [edit]

Poster: tfmorris Date: Feb 11, 2014 8:56am
Forum: texts Subject: Re: Using Tesseract to improving OCR for some languages

Does anyone look at improving OCR quality on an ongoing basis, whether through the use of Tesserract or other means? Is OCR ever re-done after the initial pass?

I did a study recently scoring the OCR quality of public domain eBooks in IA and found the quality to be all over the map. I suspect that Tesserract could do a better job in many cases, but I also suspect that ABBYY could be improved as well.

I saw some anecdotal evidence that high processing loads on the OCR cluster and the use of "fast mode" was correlated low OCR quality. That seems fine as an interim if the books were later requeued for full processing, but that doesn't seem to happen.

Like Nick, I'd be willing to help with improving the OCR quality.

Tom